Artificial Neural Networks (MLPs) for Tabular Data#

An artificial neural network for tabular data is usually a multi-layer perceptron (MLP): stacks of Linear layers with nonlinear activations (ReLU, GELU, …).

MLPs are a great learning tool because you can understand them end-to-end:

  • a forward pass is just matrix multiplications + activations

  • training is “just” gradient descent on a loss (via backprop)

On many real-world tabular problems, tree-based models (XGBoost/LightGBM/CatBoost) are often the strongest baseline; MLPs tend to shine when you have lots of data, learned embeddings for categorical features, or you need to combine tabular with other modalities.


Learning goals#

By the end, you should be able to:

  • explain how an MLP turns features into predictions

  • implement a 2-layer MLP in NumPy (forward + backprop)

  • train it with mini-batch SGD and visualize learning curves

  • build the same model in PyTorch and compare results

  • diagnose common tabular-MLP pitfalls (scaling, overfitting, LR)

Notation (quick)#

  • Features: \(X \in \mathbb{R}^{n\times d}\) (rows are samples)

  • Labels (binary): \(y \in \{0,1\}^n\)

  • First layer: \(z_1 = XW_1 + b_1\), \(a_1 = \mathrm{ReLU}(z_1)\)

  • Output logits: \(\ell = a_1W_2 + b_2\) (probability via sigmoid)


Table of contents#

  1. What makes tabular data special?

  2. A tiny nonlinear dataset + why scaling matters

  3. Baseline: logistic regression (linear boundary)

  4. From scratch: a 2-layer MLP in NumPy

  5. Practical: the same model in PyTorch

  6. Compare models + diagnostics

  7. Practical tips for real tabular data

  8. Exercises + references

import numpy as np
import plotly.express as px
import plotly.graph_objects as go
import os
import plotly.io as pio

from sklearn.datasets import make_moons
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import accuracy_score, confusion_matrix, log_loss
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler

import torch
import torch.nn as nn
from torch.utils.data import DataLoader, TensorDataset

pio.templates.default = "plotly_white"
pio.renderers.default = os.environ.get("PLOTLY_RENDERER", "notebook")
np.set_printoptions(precision=4, suppress=True)

SEED = 42
rng = np.random.default_rng(SEED)

import warnings

torch.manual_seed(SEED)
with warnings.catch_warnings():
    warnings.filterwarnings("ignore", message="CUDA initialization*", category=UserWarning)
    has_cuda = torch.cuda.is_available()
if has_cuda:
    torch.cuda.manual_seed_all(SEED)

device = torch.device("cuda" if has_cuda else "cpu")
device
device(type='cpu')

1) What makes tabular data special?#

Tabular data usually means:

  • each row is an entity (customer, transaction, patient)

  • columns are heterogeneous features (numeric + categorical + missing)

Compared to images/text, tabular datasets are often smaller and noisier, and the “right” inductive bias is less obvious.

For MLPs specifically, two habits matter a lot:

  • standardize numeric features (helps optimization)

  • treat categorical features carefully (often via embeddings)

2) A tiny nonlinear dataset + why scaling matters#

We’ll use a simple 2D dataset so we can visualize the decision boundary.

Even though it’s 2D, it’s still “tabular”: each row is a sample, and the two columns are features.

To make the scaling issue obvious, we’ll intentionally stretch one feature.

# Dataset
n_samples = 2000
X_raw, y = make_moons(n_samples=n_samples, noise=0.25, random_state=SEED)

# Force a scale mismatch (common in real tabular datasets)
X_raw = X_raw.astype(np.float64)
X_raw[:, 1] *= 3.0
y = y.astype(np.int64)

# Train/val/test split
X_train_raw, X_temp_raw, y_train, y_temp = train_test_split(
    X_raw,
    y,
    test_size=0.30,
    random_state=SEED,
    stratify=y,
)

X_val_raw, X_test_raw, y_val, y_test = train_test_split(
    X_temp_raw,
    y_temp,
    test_size=0.50,
    random_state=SEED,
    stratify=y_temp,
)

# Standardize using train split only
scaler = StandardScaler().fit(X_train_raw)
X_train = scaler.transform(X_train_raw)
X_val = scaler.transform(X_val_raw)
X_test = scaler.transform(X_test_raw)

X_train.shape, X_val.shape, X_test.shape
((1400, 2), (300, 2), (300, 2))
fig = px.scatter(
    x=X_raw[:, 0],
    y=X_raw[:, 1],
    color=y.astype(str),
    title="Raw features (note the scale mismatch)",
    labels={"x": "feature_1", "y": "feature_2", "color": "class"},
)
fig.update_traces(marker=dict(size=5, opacity=0.7))
fig.show()
X_all = scaler.transform(X_raw)
fig = px.scatter(
    x=X_all[:, 0],
    y=X_all[:, 1],
    color=y.astype(str),
    title="Standardized features (zero mean, unit variance)",
    labels={"x": "z(feature_1)", "y": "z(feature_2)", "color": "class"},
)
fig.update_traces(marker=dict(size=5, opacity=0.7))
fig.show()
def decision_boundary_figure(X2d, y, prob_fn, title, grid_n=250, pad=0.6):
    X2d = np.asarray(X2d)
    y = np.asarray(y)

    x0_min, x0_max = X2d[:, 0].min() - pad, X2d[:, 0].max() + pad
    x1_min, x1_max = X2d[:, 1].min() - pad, X2d[:, 1].max() + pad

    xs = np.linspace(x0_min, x0_max, grid_n)
    ys = np.linspace(x1_min, x1_max, grid_n)
    xx, yy = np.meshgrid(xs, ys)
    grid = np.c_[xx.ravel(), yy.ravel()]

    probs = prob_fn(grid).reshape(xx.shape)

    fig = go.Figure()

    # Probability surface
    fig.add_trace(
        go.Contour(
            x=xs,
            y=ys,
            z=probs,
            zmin=0.0,
            zmax=1.0,
            colorscale="RdBu",
            reversescale=True,
            opacity=0.75,
            colorbar=dict(title="P(class=1)"),
            contours=dict(start=0.0, end=1.0, size=0.1),
        )
    )

    # Decision boundary line at 0.5
    fig.add_trace(
        go.Contour(
            x=xs,
            y=ys,
            z=probs,
            contours=dict(start=0.5, end=0.5, size=0.5, coloring="lines"),
            line=dict(color="black", width=3),
            showscale=False,
        )
    )

    # Points
    fig.add_trace(
        go.Scatter(
            x=X2d[:, 0],
            y=X2d[:, 1],
            mode="markers",
            marker=dict(color=y, colorscale="Viridis", size=5, opacity=0.75),
            name="data",
        )
    )

    fig.update_layout(
        title=title,
        xaxis_title="feature_1 (standardized)",
        yaxis_title="feature_2 (standardized)",
        legend=dict(orientation="h", yanchor="bottom", y=1.02, xanchor="right", x=1),
    )
    return fig

3) Baseline: logistic regression (linear boundary)#

Logistic regression is a linear classifier: it can only draw a single straight line in 2D.

Our dataset needs a curved boundary, so logistic regression should underfit.

log_reg = LogisticRegression(max_iter=2000, random_state=SEED)
log_reg.fit(X_train, y_train)

def eval_sklearn_binary(model, X, y):
    probs = model.predict_proba(X)[:, 1]
    preds = (probs >= 0.5).astype(np.int64)
    return {
        "acc": float(accuracy_score(y, preds)),
        "logloss": float(log_loss(y, probs)),
    }

baseline_metrics = {
    "train": eval_sklearn_binary(log_reg, X_train, y_train),
    "val": eval_sklearn_binary(log_reg, X_val, y_val),
    "test": eval_sklearn_binary(log_reg, X_test, y_test),
}
baseline_metrics
{'train': {'acc': 0.86, 'logloss': 0.3129837863524164},
 'val': {'acc': 0.8666666666666667, 'logloss': 0.29755669561639714},
 'test': {'acc': 0.89, 'logloss': 0.27348769453997146}}
fig = decision_boundary_figure(
    X_train,
    y_train,
    prob_fn=lambda X: log_reg.predict_proba(X)[:, 1],
    title="Logistic regression decision boundary (linear)",
)
fig.show()

4) From scratch: a 2-layer MLP in NumPy#

A 2-layer MLP is:

  1. a linear layer that mixes the input features

  2. a nonlinearity (ReLU)

  3. another linear layer to produce a logit

Even this small network can produce a piecewise-linear decision boundary that bends around the data.

Forward pass (binary classification)#

Hidden layer:

\[ z_1 = XW_1 + b_1\quad\Rightarrow\quad a_1 = \mathrm{ReLU}(z_1) \]

Output logit:

\[ \ell = a_1 W_2 + b_2 \]

Probability:

\[ p = \sigma(\ell) = \frac{1}{1 + e^{-\ell}} \]

Loss (binary cross-entropy, computed stably from logits):

\[ \mathcal{L} = \frac{1}{n}\sum_i \left[\log(1+e^{\ell_i}) - y_i\ell_i\right] \]

Key gradient fact:

\[ \frac{\partial \mathcal{L}}{\partial \ell} = \sigma(\ell) - y \]
def relu(x):
    return np.maximum(0.0, x)


def sigmoid(x):
    return 1.0 / (1.0 + np.exp(-x))


def bce_with_logits_loss(logits, y):
    """Mean binary cross-entropy, but computed stably from logits.

    logits: (n, 1)
    y:      (n, 1) in {0,1}
    """
    logits = np.asarray(logits)
    y = np.asarray(y)
    return float((np.logaddexp(0.0, logits) - y * logits).mean())


def accuracy_from_logits(logits, y):
    probs = sigmoid(logits)
    preds = (probs >= 0.5).astype(np.int64)
    return float((preds.ravel() == y.ravel()).mean())
def init_mlp(in_dim, hidden_dim, rng):
    """He initialization is a good default for ReLU networks."""
    W1 = rng.normal(0.0, np.sqrt(2.0 / in_dim), size=(in_dim, hidden_dim))
    b1 = np.zeros((hidden_dim,), dtype=np.float64)

    W2 = rng.normal(0.0, np.sqrt(2.0 / hidden_dim), size=(hidden_dim, 1))
    b2 = np.zeros((1,), dtype=np.float64)

    return {"W1": W1, "b1": b1, "W2": W2, "b2": b2}


def mlp_forward(X, params):
    W1, b1, W2, b2 = params["W1"], params["b1"], params["W2"], params["b2"]

    z1 = X @ W1 + b1
    a1 = relu(z1)
    logits = a1 @ W2 + b2

    cache = {"X": X, "z1": z1, "a1": a1}
    return logits, cache


def mlp_loss_and_grads(X, y, params, weight_decay=0.0):
    """Return loss and gradients for a 2-layer MLP."""
    y = y.reshape(-1, 1).astype(np.float64)
    logits, cache = mlp_forward(X, params)

    loss = bce_with_logits_loss(logits, y)
    if weight_decay:
        loss += 0.5 * weight_decay * (np.sum(params["W1"] ** 2) + np.sum(params["W2"] ** 2))

    probs = sigmoid(logits)
    n = X.shape[0]

    # dL/dlogits = (sigmoid(logits) - y) / n
    dlogits = (probs - y) / n

    dW2 = cache["a1"].T @ dlogits
    db2 = dlogits.sum(axis=0)

    da1 = dlogits @ params["W2"].T
    dz1 = da1 * (cache["z1"] > 0.0)

    dW1 = cache["X"].T @ dz1
    db1 = dz1.sum(axis=0)

    if weight_decay:
        dW1 = dW1 + weight_decay * params["W1"]
        dW2 = dW2 + weight_decay * params["W2"]

    grads = {"W1": dW1, "b1": db1, "W2": dW2, "b2": db2}
    return loss, grads
def train_numpy_mlp(
    X_train,
    y_train,
    X_val,
    y_val,
    *,
    hidden_dim=32,
    lr=0.1,
    epochs=200,
    batch_size=128,
    weight_decay=1e-4,
    seed=SEED,
):
    rng_local = np.random.default_rng(seed)
    params = init_mlp(in_dim=X_train.shape[1], hidden_dim=hidden_dim, rng=rng_local)

    history = {
        "epoch": [],
        "train_loss": [],
        "val_loss": [],
        "train_acc": [],
        "val_acc": [],
    }

    y_train_col = y_train.reshape(-1, 1)
    y_val_col = y_val.reshape(-1, 1)

    for epoch in range(1, epochs + 1):
        idx = rng_local.permutation(X_train.shape[0])

        for start in range(0, X_train.shape[0], batch_size):
            batch_idx = idx[start : start + batch_size]
            Xb = X_train[batch_idx]
            yb = y_train_col[batch_idx]

            _, grads = mlp_loss_and_grads(Xb, yb, params, weight_decay=weight_decay)

            params["W1"] -= lr * grads["W1"]
            params["b1"] -= lr * grads["b1"]
            params["W2"] -= lr * grads["W2"]
            params["b2"] -= lr * grads["b2"]

        train_logits, _ = mlp_forward(X_train, params)
        val_logits, _ = mlp_forward(X_val, params)

        train_loss = bce_with_logits_loss(train_logits, y_train_col)
        val_loss = bce_with_logits_loss(val_logits, y_val_col)

        train_acc = accuracy_from_logits(train_logits, y_train_col)
        val_acc = accuracy_from_logits(val_logits, y_val_col)

        history["epoch"].append(epoch)
        history["train_loss"].append(train_loss)
        history["val_loss"].append(val_loss)
        history["train_acc"].append(train_acc)
        history["val_acc"].append(val_acc)

    return params, history
params_np, hist_np = train_numpy_mlp(
    X_train,
    y_train,
    X_val,
    y_val,
    hidden_dim=32,
    lr=0.1,
    epochs=200,
    batch_size=128,
    weight_decay=1e-4,
)

def eval_numpy_mlp(params, X, y):
    logits, _ = mlp_forward(X, params)
    probs = sigmoid(logits).ravel()
    preds = (probs >= 0.5).astype(np.int64)
    return {
        "acc": float(accuracy_score(y, preds)),
        "logloss": float(log_loss(y, probs)),
    }

numpy_metrics = {
    "train": eval_numpy_mlp(params_np, X_train, y_train),
    "val": eval_numpy_mlp(params_np, X_val, y_val),
    "test": eval_numpy_mlp(params_np, X_test, y_test),
}
numpy_metrics
{'train': {'acc': 0.9478571428571428, 'logloss': 0.14365309037199106},
 'val': {'acc': 0.9333333333333333, 'logloss': 0.15531360103865552},
 'test': {'acc': 0.95, 'logloss': 0.14805155593491454}}
fig = go.Figure()
fig.add_trace(go.Scatter(x=hist_np["epoch"], y=hist_np["train_loss"], name="train"))
fig.add_trace(go.Scatter(x=hist_np["epoch"], y=hist_np["val_loss"], name="val"))
fig.update_layout(
    title="NumPy MLP: loss over epochs",
    xaxis_title="epoch",
    yaxis_title="binary cross-entropy",
)
fig.show()
fig = go.Figure()
fig.add_trace(go.Scatter(x=hist_np["epoch"], y=hist_np["train_acc"], name="train"))
fig.add_trace(go.Scatter(x=hist_np["epoch"], y=hist_np["val_acc"], name="val"))
fig.update_layout(
    title="NumPy MLP: accuracy over epochs",
    xaxis_title="epoch",
    yaxis_title="accuracy",
    yaxis=dict(range=[0.0, 1.0]),
)
fig.show()
fig = decision_boundary_figure(
    X_train,
    y_train,
    prob_fn=lambda X: sigmoid(mlp_forward(X, params_np)[0]).ravel(),
    title="NumPy MLP decision boundary (nonlinear)",
)
fig.show()
probs_np_test = sigmoid(mlp_forward(X_test, params_np)[0]).ravel()
preds_np_test = (probs_np_test >= 0.5).astype(np.int64)
cm = confusion_matrix(y_test, preds_np_test)

fig = px.imshow(
    cm,
    text_auto=True,
    color_continuous_scale="Blues",
    title="NumPy MLP: confusion matrix (test)",
    labels=dict(x="predicted", y="true", color="count"),
)
fig.update_xaxes(tickmode="array", tickvals=[0, 1])
fig.update_yaxes(tickmode="array", tickvals=[0, 1])
fig.show()

5) Practical: the same model in PyTorch#

PyTorch gives you:

  • automatic differentiation (no manual backprop)

  • battle-tested optimizers (Adam, SGD+momentum)

  • easy batching with DataLoader

We’ll build the same architecture and train it on the same standardized data.

X_train_t = torch.tensor(X_train, dtype=torch.float32)
y_train_t = torch.tensor(y_train.reshape(-1, 1), dtype=torch.float32)
X_val_t = torch.tensor(X_val, dtype=torch.float32)
y_val_t = torch.tensor(y_val.reshape(-1, 1), dtype=torch.float32)
X_test_t = torch.tensor(X_test, dtype=torch.float32)
y_test_t = torch.tensor(y_test.reshape(-1, 1), dtype=torch.float32)

train_loader = DataLoader(TensorDataset(X_train_t, y_train_t), batch_size=128, shuffle=True)
val_loader = DataLoader(TensorDataset(X_val_t, y_val_t), batch_size=256, shuffle=False)

torch_model = nn.Sequential(
    nn.Linear(X_train.shape[1], 32),
    nn.ReLU(),
    nn.Linear(32, 1),
).to(device)

criterion = nn.BCEWithLogitsLoss()
optimizer = torch.optim.Adam(torch_model.parameters(), lr=0.03, weight_decay=1e-4)

def run_epoch(model, loader, *, train=False):
    if train:
        model.train()
    else:
        model.eval()

    total_loss = 0.0
    total_correct = 0.0
    n = 0

    for xb, yb in loader:
        xb = xb.to(device)
        yb = yb.to(device)

        logits = model(xb)
        loss = criterion(logits, yb)

        if train:
            optimizer.zero_grad()
            loss.backward()
            optimizer.step()

        with torch.no_grad():
            probs = torch.sigmoid(logits)
            preds = (probs >= 0.5).float()
            total_correct += (preds == yb).float().sum().item()

        total_loss += loss.item() * xb.shape[0]
        n += xb.shape[0]

    return total_loss / n, total_correct / n
torch_hist = {
    "epoch": [],
    "train_loss": [],
    "val_loss": [],
    "train_acc": [],
    "val_acc": [],
}

epochs = 120
for epoch in range(1, epochs + 1):
    train_loss, train_acc = run_epoch(torch_model, train_loader, train=True)
    val_loss, val_acc = run_epoch(torch_model, val_loader, train=False)

    torch_hist["epoch"].append(epoch)
    torch_hist["train_loss"].append(float(train_loss))
    torch_hist["val_loss"].append(float(val_loss))
    torch_hist["train_acc"].append(float(train_acc))
    torch_hist["val_acc"].append(float(val_acc))

@torch.no_grad()
def torch_predict_proba(model, X):
    model.eval()
    Xt = torch.tensor(X, dtype=torch.float32, device=device)
    probs = torch.sigmoid(model(Xt)).detach().cpu().numpy().ravel()
    return probs

probs_torch_test = torch_predict_proba(torch_model, X_test)
preds_torch_test = (probs_torch_test >= 0.5).astype(np.int64)

torch_metrics = {
    "train": {
        "acc": float(accuracy_score(y_train, (torch_predict_proba(torch_model, X_train) >= 0.5).astype(np.int64))),
        "logloss": float(log_loss(y_train, torch_predict_proba(torch_model, X_train))),
    },
    "val": {
        "acc": float(accuracy_score(y_val, (torch_predict_proba(torch_model, X_val) >= 0.5).astype(np.int64))),
        "logloss": float(log_loss(y_val, torch_predict_proba(torch_model, X_val))),
    },
    "test": {
        "acc": float(accuracy_score(y_test, preds_torch_test)),
        "logloss": float(log_loss(y_test, probs_torch_test)),
    },
}
torch_metrics
{'train': {'acc': 0.95, 'logloss': 0.13017641005047428},
 'val': {'acc': 0.93, 'logloss': 0.1488189262504534},
 'test': {'acc': 0.95, 'logloss': 0.14634598848775623}}
fig = go.Figure()
fig.add_trace(go.Scatter(x=torch_hist["epoch"], y=torch_hist["train_loss"], name="train"))
fig.add_trace(go.Scatter(x=torch_hist["epoch"], y=torch_hist["val_loss"], name="val"))
fig.update_layout(
    title="PyTorch MLP: loss over epochs",
    xaxis_title="epoch",
    yaxis_title="binary cross-entropy",
)
fig.show()
fig = go.Figure()
fig.add_trace(go.Scatter(x=torch_hist["epoch"], y=torch_hist["train_acc"], name="train"))
fig.add_trace(go.Scatter(x=torch_hist["epoch"], y=torch_hist["val_acc"], name="val"))
fig.update_layout(
    title="PyTorch MLP: accuracy over epochs",
    xaxis_title="epoch",
    yaxis_title="accuracy",
    yaxis=dict(range=[0.0, 1.0]),
)
fig.show()
fig = decision_boundary_figure(
    X_train,
    y_train,
    prob_fn=lambda X: torch_predict_proba(torch_model, X),
    title="PyTorch MLP decision boundary (nonlinear)",
)
fig.show()
cm = confusion_matrix(y_test, preds_torch_test)
fig = px.imshow(
    cm,
    text_auto=True,
    color_continuous_scale="Blues",
    title="PyTorch MLP: confusion matrix (test)",
    labels=dict(x="predicted", y="true", color="count"),
)
fig.update_xaxes(tickmode="array", tickvals=[0, 1])
fig.update_yaxes(tickmode="array", tickvals=[0, 1])
fig.show()

6) Compare models + diagnostics#

On this toy dataset, both MLPs should learn a nonlinear boundary and outperform logistic regression.

We’ll compare test accuracy and log loss (probabilistic quality).

models = ["log_reg", "numpy_mlp", "torch_mlp"]
test_acc = [
    baseline_metrics["test"]["acc"],
    numpy_metrics["test"]["acc"],
    torch_metrics["test"]["acc"],
]
test_logloss = [
    baseline_metrics["test"]["logloss"],
    numpy_metrics["test"]["logloss"],
    torch_metrics["test"]["logloss"],
]

fig = go.Figure(go.Bar(x=models, y=test_acc))
fig.update_layout(title="Test accuracy", xaxis_title="model", yaxis_title="accuracy", yaxis=dict(range=[0.0, 1.0]))
fig.show()

fig = go.Figure(go.Bar(x=models, y=test_logloss))
fig.update_layout(title="Test log loss (lower is better)", xaxis_title="model", yaxis_title="log loss")
fig.show()

7) Practical tips for real tabular data#

  • Standardize numeric features (and keep the scaler fitted on train only).

  • Categorical features: try learned embeddings (nn.Embedding) instead of one-hot for high-cardinality columns.

  • Missing values: add missingness indicators; don’t just impute and hope.

  • Overfitting is common: use weight decay, dropout, early stopping, and a strong validation protocol.

  • Learning rate matters more than architecture. When in doubt, sweep lr and use Adam.

  • Baselines first: compare against logistic regression and strong tree-based models.

  • Calibration: optimize log loss / calibration if your probabilities will drive decisions.

8) Exercises#

  1. Add another hidden layer (2 hidden layers total). Does it help? Does it overfit?

  2. Replace ReLU with tanh. What changes in training speed / final accuracy?

  3. Implement dropout in the NumPy model.

  4. Turn this into a multiclass problem (softmax + cross-entropy).

  5. Try a real tabular dataset (e.g., UCI) and compare with a tree baseline.

References#

  • PyTorch: https://pytorch.org/docs/stable/index.html

  • Goodfellow, Bengio, Courville — Deep Learning (MLP + backprop chapters)

  • scikit-learn MLPClassifier docs (for a practical baseline)